Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IGNITE-22662 : Snapshot check as distributed process #11391

Conversation

Vladsz83
Copy link
Contributor

No description provided.

@Vladsz83 Vladsz83 changed the title Check snapshot as distributed process IGNITE-22662 : Snapshot check as distributed process Jul 4, 2024
@Vladsz83 Vladsz83 changed the base branch from master to IGNITE-22662__snapshot_refactoring July 10, 2024 10:29
IgniteSnapshotManager snpMgr = kctx.cache().context().snapshotMgr();

if (allRestoreHandlers) {
workingFut = CompletableFuture.supplyAsync(() -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should specify the snapshot executor for the async job

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid not if we use the same executor somewhere inside the task. I tried. The executor might be confgured with single thred. This thread is blocked with the waiting task. No thread left for the workers. Tests like testChangeSnapshotTransferRateInRuntime() hangs.

void interrupt(Throwable err) {
contexts.forEach((snpName, ctx) -> {
if (ctx.fut != null)
ctx.fut.onDone(err);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible interrupt will not interrupt anything. You still have a period of time:

  1. First phase don't start and futures are nulls.
  2. First phase finished, second phase still not started and futures are nulls.

I again recommend you create a future that lives on every node during all process and phase futures listen it.

Copy link
Contributor Author

@Vladsz83 Vladsz83 Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interrupt() is a node stopping. SnapshotManager stops, its thread pool stops. Discovery should not work, process phases should start and be able to work because SnapshotManager stops. No problem is expected. Also, keeping single future requires resseting it. The same race. Canceling a finished future does nothing. Even not soring the exception and a canceled flag. reset() would revive future and restart it.

if (ctx.fut != null)
ctx.fut.onDone(err);

it.remove();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you remove context here? Reduce phase is already responsible for clean up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we do not store error any more. We cannot recognize that we should not work at the reduce. The phase won't work if there is no context. If it is, the phase works. Also because stop future here.

SnapshotCheckContext ctx;

// The context can be null, if a required node leaves before this phase.
if (!req.nodes().contains(kctx.localNodeId()) || (ctx = context(null, req.requestId())) == null || ctx.locMeta == null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does ctx.locMeta == null cover case !req.nodes().contains(kctx.localNodeId())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We've added client-initiator to the required nodes. It has no meta. Previously, the required nodes were only data nodes. Also snapshot might be placed in any manner to the baseline nodes and/or be restored from another cluster. Any node may miss snapshot meta. Test testRestoreFromAnEmptyNode() show case like that.

*
* @param err The interrupt reason.
*/
void interrupt(Throwable err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is invoked after IgniteSnapshotManager#busyLock acquired. But checking snapshot doesn't check this lock. Looks like all this collections aren't synchronized with stopping node.

Copy link
Contributor Author

@Vladsz83 Vladsz83 Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method shoud be be renamed to 'onStop()'. Everything is stopping. SnapshotManager is stopping, thread pools is stopping. Discovery should not accept or process messages. No another check process must be able to start. Even it starts, it must not be able to work due to stopping thread pools. No problems are excpected.


// A not required node can leave the cluster and its result can be null.
return results.entrySet().stream()
.filter(e -> requiredNodes.contains(e.getKey()) && e.getValue() != null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we already check requiredNodes in assert in the reduceValidatePartsAndFinish?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Results can contain nulls, come from not required, not a data nodes. NPEs will arise.

boolean skipPartsHashes
) {
try {
return checkPartitions(meta, snpDir, groups, forCreation, checkParts, skipPartsHashes).get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, let's invoker calls get() with timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no any timeout on snapshot validation. User cannot define it. Even can't cancel yet. There is no any value to pass.

catch (IgniteCheckedException e) {
throw new IgniteException("Failed to check partitions of snapshot '" + meta.snapshotName() + "'.", e);
}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snapshot executor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. The check funtions uses it, the same executor. If it configured with just 1 thread (setSnapshotThreadPoolSize(1)), we'll freeze here. Tests like testChangeSnapshotTransferRateInRuntime() would hang.

@timoninmaxim timoninmaxim merged commit 8ef9bcf into apache:IGNITE-22662__snapshot_refactoring Aug 22, 2024
@Vladsz83 Vladsz83 deleted the checkSnpAsDistrProc branch August 22, 2024 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants